Automatic Thesaurus Generation from Raw Text using Knowledge-Poor Techniques

نویسنده

  • Gregory Grefenstette
چکیده

In addition to showing how lexical units are related within a eld, domain-speciic thesauri give an idea of what subjects are important to that eld and are thus useful at many points in an information system. The major impediment to creation of thesauri has been the cost of their manual creation. We present here a number of automatic techniques that jointly produce a rst draft of a thesaurus from any domain-deening collection of text. The techniques are knowledge-poor in that no domain knowledge is required for their use. We have successfully applied these techniques to over twenty corpora ranging from 1 to 6 megabytes. Results from the thesaurus produced from a collection of medical abstracts will also be presented here.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Hearst's Rules for the Automatic Acquisition of Hyponyms for Mining a Pharmaceutical Corpus

Fully Automatic Thesaurus Generation (ATG) seeks to generate useful thesauri by mining a corpus of raw text. A number of statistical approaches, based on term co­ occurrence, exist for this, but in general they are only able to estimate the strength of the relationship between two terms, not its nature. In this paper we implement Hearst's method of discovering the hyponymy relations which are t...

متن کامل

Information Retrieval Tasks

Techniques of automatic natural language processing have been under development since the earliest computing machines, and in recent years these techniques have proven to be robust, reliable and efficient enough to lead to commercial products in many areas. The applications include machine translation, natural language interfaces and the stylistic analysis of texts but NLP techniques have also ...

متن کامل

Evaluation Techniques For Automatic Semantic Extraction: Comparing Syntactic And Window Based Approaches

As large on-line corpora become more prevalent, a number of attempts have been made to automatically extract thesaurus-like relations directly from text using knowledge poor methods. In the absence of any specific application, comparing the results of these attempts is difficult. Here we propose an evaluation method using gold standards, i.e., pre-existing hand-compiled resources, as a means of...

متن کامل

Evaluation Techniques for Automatic SemanticExtraction : Comparing Syntactic and Window

As large on-line corpora become more prevalent, a number of attempts have been made to automatically extract thesaurus-like relations directly from text using knowledge poor methods. In the absence of any speciic application, comparing the results of these attempts is diicult. Here we propose an evaluation method using gold standards , i.e., pre-existing hand-compiled resources, as a means of c...

متن کامل

A Method for Re ning Automatically-Discovered Lexical Relations: Combining Weak Techniques for Stronger Results

Knowledge-poor corpus-based approaches to natural language processing are attractive in that they do not incur the diiculties associated with complex knowledge bases and real-world inferences. However, these kinds of language processing techniques in isolation often do not suuce for a particular task; for this reason we are interested in nding ways to combine various techniques and improve thei...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1993